Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Neurips23] NGT Submission for OOD track #187

Merged
merged 7 commits into from
Nov 3, 2023

Conversation

masajiro
Copy link
Contributor

This is an improved version of NGT ONNG. Please check it out.

@harsha-simhadri
Copy link
Owner

harsha-simhadri commented Oct 27, 2023

@masajiro thanks for the PR. I am not able to run this to completion. I am attaching the trailing log from my VM. Could you please help me debug..

# of the processed objects=7598940 VM size=14.09 G Peak VM size=14.09 G Time=2.40146 (h)
# of the processed objects=7770974 VM size=14.16 G Peak VM size=14.16 G Time=2.42843 (h)
# of the processed objects=7964442 VM size=14.25 G Peak VM size=14.25 G Time=2.45806 (h)
# of the processed objects=7969444 VM size=14.25 G Peak VM size=14.25 G Time=2.45881 (h)
# of the processed objects=8236556 VM size=14.38 G Peak VM size=14.38 G Time=2.49862 (h)
# of the processed objects=8304888 VM size=14.41 G Peak VM size=14.41 G Time=2.50865 (h)
Traceback (most recent call last):
  File "/home/app/run_algorithm.py", line 3, in <module>
    run_from_cmdline()
  File "/home/app/benchmark/runner.py", line 228, in run_from_cmdline
    run(definition, args.dataset, args.count, args.runs, args.rebuild,
  File "/home/app/benchmark/runner.py", line 71, in run
    build_time = (custom_runner.build(algo,dataset)
  File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
    algo.fit(dataset)
  File "/home/app/neurips23/ood/ngt/module.py", line 141, in fit
    subprocess.run(args, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ngt', 'construct-graph', '-v', '-Go', '-T0', '-P0', '-N140', '-O10', '-I180', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/sanng']' died with <Signals.SIGKILL: 9>.

@masajiro
Copy link
Contributor Author

Thanks for checking. We have confirmed that this program runs on Standard D8lds v5. It might be because other processes are using memory and there is not enough memory for this program to run.

@masajiro
Copy link
Contributor Author

When you run it again, the half-finished index data/indices/ood/ngt must be removed.

@harsha-simhadri
Copy link
Owner

When you run it again, the half-finished index data/indices/ood/ngt must be removed.

OK, I was trying to debug repeated crash and realized this was happening :) Anyway, thanks for the warning. Is it possible to set up your submission to clean up from crashes -- it would make the overall benchmark framework easier to rerun/replicate in the future. Thanks!

@masajiro
Copy link
Contributor Author

I would like to know the exact available memory of your azure server. The available memory of our azure server (Standard D8lds v5) is about 15.4 G from the following commands.

$ free
               total        used        free      shared  buff/cache   available
Mem:        16362296      557696      177124        4452    15627476    15445464
Swap:              0           0           0
$ grep Mem /proc/meminfo
MemTotal:       16362296 kB
MemFree:          176876 kB
MemAvailable:   15445216 kB

@harsha-simhadri
Copy link
Owner

I am the only user logged in and yours is the only job I ran. I did so last time as well. I see a similar problem again.

# of the processed objects=9690929 VM size=15.04 G Peak VM size=15.04 G Time=2.74366 (h)
Traceback (most recent call last):
  File "/home/app/run_algorithm.py", line 3, in <module>
    run_from_cmdline()
  File "/home/app/benchmark/runner.py", line 228, in run_from_cmdline
    run(definition, args.dataset, args.count, args.runs, args.rebuild,
  File "/home/app/benchmark/runner.py", line 71, in run
    build_time = (custom_runner.build(algo,dataset)
  File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
    algo.fit(dataset)
  File "/home/app/neurips23/ood/ngt/module.py", line 141, in fit
    subprocess.run(args, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ngt', 'construct-graph', '-v', '-Go', '-T0', '-P0', '-N140', '-O10', '-I180', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/sanng']' died with <Signals.SIGKILL: 9>.

After the crash, I see

harshasi@dnode2:~$ free
               total        used        free      shared  buff/cache   available
Mem:        16310996     1030380    13865604        4172     1415012    14921760
Swap:              0           0           0

harshasi@dnode2:~$ grep Mem /proc/meminfo
MemTotal:       16310996 kB
MemFree:        13865604 kB
MemAvailable:   14921744 kB

Could you please consider tuning down your index build to use 0.5GB less DRAM to allow for fluctuations in available memory?

@masajiro
Copy link
Contributor Author

Thank you for your information. It is strange that there is no small difference in the available memory. Anyway, we will try to reduce the memory usage to less than 15.0 GB.

@masajiro
Copy link
Contributor Author

The peak VM size of this latest commit is 14.98G. Could you please check this out?

@harsha-simhadri
Copy link
Owner

I see the following crash now (I did remove previous index files before launching the run):

# of the processed objects=9690927 VM size=14.85 G Peak VM size=14.85 G Time=2.62833 (h)
# of the processed objects=9939855 VM size=14.96 G Peak VM size=14.96 G Time=2.66665 (h)
Warning: not found the target node in the result. 9999742
search is finished.
VM size=14.95 G
Peak VM size=14.98 G
constructONNGFromANNG...
range=1:1000000
range=1000001:2000000
range=2000001:3000000
range=3000001:4000000
range=4000001:5000000
range=5000001:6000000
range=6000001:7000000
range=7000001:8000000
range=8000001:9000000
range=9000001:10000000
invalid graph type. o
aggregating...
range=1:1000000
range=1000001:2000000
range=2000001:3000000
range=3000001:4000000
range=4000001:5000000
range=5000001:6000000
range=6000001:7000000
range=7000001:8000000
range=8000001:9000000
range=9000001:10000000
aggregation is finished.
VM size=14.73 G
Peak VM size=14.98 G
concatinating...
# of groups=10
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_0
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_1
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_2
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_3
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_4
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_5
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_6
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_7
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_8
path=data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140/grp_9
concatination is finished.
VM size=14.73 G
Peak VM size=14.98 G
Successfully completed.
ONNG: degree ajustment time(sec)=10826.341586112976
ONNG: shortcut reduction
ONNG: 'ngt reconstruct-graph -v -R0.39 -mS -Ps -sp -o0 -i0 data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140 data/indices/ood/ngt/index-140-10-180-0.10-0.39/onng'
Traceback (most recent call last):
  File "/home/app/run_algorithm.py", line 3, in <module>
    run_from_cmdline()
  File "/home/app/benchmark/runner.py", line 228, in run_from_cmdline
    run(definition, args.dataset, args.count, args.runs, args.rebuild,
  File "/home/app/benchmark/runner.py", line 71, in run
    build_time = (custom_runner.build(algo,dataset)
  File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
    algo.fit(dataset)
  File "/home/app/neurips23/ood/ngt/module.py", line 160, in fit
    subprocess.run(args, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ngt', 'reconstruct-graph', '-v', '-R0.39', '-mS', '-Ps', '-sp', '-o0', '-i0', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/anng-140', 'data/indices/ood/ngt/index-140-10-180-0.10-0.39/onng']' died with <Signals.SIGKILL: 9>.

@masajiro
Copy link
Contributor Author

I assume that the available memory is gradually decreasing. We don't see this situation on our servers. If you could delete the directory data/indices/ood/ngt/index-140-10-180-0.10-0.39/onng and rerun it, the index might be build.

@masajiro
Copy link
Contributor Author

When you rerun it as mentioned above, please confirm that 15.0 G is available. The following command may help you find processes that are wasting memory.

$ ps aux --sort -rss

@masajiro
Copy link
Contributor Author

@harsha-simhadri
We have adjusted the parameters to reduce the memory footprint more. Could you rerun it?

@masajiro
Copy link
Contributor Author

The parameters are finalized.

@masajiro
Copy link
Contributor Author

masajiro commented Nov 1, 2023

@harsha-simhadri
The last commit can reduce memory usage. Did you rerun this?

@maumueller
Copy link
Collaborator

@harsha-simhadri It worked for me, but I was testing on c6i.2xlarge.

ngt,"ngt-onng(140, 10, 175, 0.38, 1.01)",text2image-10M,10,9657.805286277335,0.0,20896.91178059578,13527552.0,0,0,ood,0.8382860000000001
ngt,"ngt-onng(140, 10, 175, 0.38, 1.014)",text2image-10M,10,7304.094322075581,0.0,20896.91178059578,13527552.0,0,0,ood,0.8806309999999999
ngt,"ngt-onng(140, 10, 175, 0.38, 1.016)",text2image-10M,10,6036.177367051964,0.0,20896.91178059578,13527552.0,0,0,ood,0.898127
ngt,"ngt-onng(140, 10, 175, 0.38, 1.017)",text2image-10M,10,5483.480310633771,0.0,20896.91178059578,13527552.0,0,0,ood,0.905559
ngt,"ngt-onng(140, 10, 175, 0.38, 1.018)",text2image-10M,10,5029.412194620403,0.0,20896.91178059578,13527552.0,0,0,ood,0.9125439999999999
ngt,"ngt-onng(140, 10, 175, 0.38, 1.02)",text2image-10M,10,4310.272927582099,0.0,20896.91178059578,13527552.0,0,0,ood,0.925764
ngt,"ngt-onng(140, 10, 175, 0.38, 1.025)",text2image-10M,10,2688.7485516010056,0.0,20896.91178059578,13527552.0,0,0,ood,0.951459

The peak VM size seemed to have been

2023-11-02 15:53:59,939 - annb.dcfdf8e366c4 - INFO - Peak VM size=14.98 G

@maumueller
Copy link
Collaborator

Merging this, thanks for your submission @masajiro.

@maumueller maumueller merged commit 2365b6c into harsha-simhadri:main Nov 3, 2023
21 of 26 checks passed
@masajiro
Copy link
Contributor Author

masajiro commented Nov 3, 2023

@maumueller
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants